{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b9261e7f-2a1f-4df0-98f7-1b350c849736",
   "metadata": {},
   "source": [
    "## A01_Explore the data and choose the KPI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02483f8b-9aa3-4986-98ae-dd7e0ba46907",
   "metadata": {},
   "source": [
    "### 1. init spark session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c6320c4b-f038-469d-992d-76901a5737a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4ac6681b-77f5-4fd3-98a5-c689e1ca76d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.appName(\"income_data\").getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e62b348a-b4d6-4e71-b6c2-cdbb022c9c71",
   "metadata": {},
   "source": [
    "### 2. read the dataset file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ee962a9b-3a28-484d-9086-8a9fd98242bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dir = \"./Landing Zone/income\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "468b2ba9-5ea0-4097-bcb3-62b6e3694554",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read.csv(data_dir, header=True, inferSchema=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b7f68e6-8b04-493c-94ff-cb69cb791808",
   "metadata": {},
   "source": [
    "### 3. show dataframe datas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "284d3017-e4a4-4ea5-9408-6549d3c10de1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "| Any|Codi_Districte| Nom_Districte|Codi_Barri|           Nom_Barri|Població|Índex RFD Barcelona = 100|\n",
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "|2007|             1|  Ciutat Vella|         1|            el Raval|   46595|                     64.7|\n",
      "|2007|             1|  Ciutat Vella|         2|      el Barri Gòtic|   27946|                     86.5|\n",
      "|2007|             1|  Ciutat Vella|         3|      la Barceloneta|   15921|                     66.7|\n",
      "|2007|             1|  Ciutat Vella|         4|Sant Pere, Santa ...|   22572|                     80.2|\n",
      "|2007|             2|      Eixample|         5|       el Fort Pienc|   31521|                    107.9|\n",
      "|2007|             2|      Eixample|         6|  la Sagrada Família|   52185|                    101.8|\n",
      "|2007|             2|      Eixample|         7|la Dreta de l'Eix...|   42504|                    137.6|\n",
      "|2007|             2|      Eixample|         8|l'Antiga Esquerra...|   41413|                    126.5|\n",
      "|2007|             2|      Eixample|         9|la Nova Esquerra ...|   58146|                    116.9|\n",
      "|2007|             2|      Eixample|        10|         Sant Antoni|   37988|                    103.8|\n",
      "|2007|             3|Sants-Montjuïc|        11|        el Poble Sec|   39579|                     73.3|\n",
      "|2007|             3|Sants-Montjuïc|        12|la Marina del Pra...|    1005|                     80.4|\n",
      "|2007|             3|Sants-Montjuïc|        13|   la Marina de Port|   29327|                     80.2|\n",
      "|2007|             3|Sants-Montjuïc|        14|la Font de la Gua...|   10064|                     90.4|\n",
      "|2007|             3|Sants-Montjuïc|        15|         Hostafrancs|   15771|                     82.7|\n",
      "|2007|             3|Sants-Montjuïc|        16|          la Bordeta|   18592|                     81.9|\n",
      "|2007|             3|Sants-Montjuïc|        17|       Sants - Badal|   24085|                     85.9|\n",
      "|2007|             3|Sants-Montjuïc|        18|               Sants|   40272|                     89.5|\n",
      "|2007|             4|     Les Corts|        19|           les Corts|   46400|                    130.4|\n",
      "|2007|             4|     Les Corts|        20|la Maternitat i S...|   23938|                    127.9|\n",
      "+----+--------------+--------------+----------+--------------------+--------+-------------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a953a796-4a23-4c49-9152-574601003f5f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "811"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97270dbf-ad8e-4b39-9e1a-50b9787949f6",
   "metadata": {},
   "source": [
    "### 4. show the dataframe schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "613569dd-72c5-42d8-8f1d-9b69202124c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- Any: integer (nullable = true)\n",
      " |-- Codi_Districte: integer (nullable = true)\n",
      " |-- Nom_Districte: string (nullable = true)\n",
      " |-- Codi_Barri: integer (nullable = true)\n",
      " |-- Nom_Barri: string (nullable = true)\n",
      " |-- Població: integer (nullable = true)\n",
      " |-- Índex RFD Barcelona = 100: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a61b7f05-7ab4-4ac9-a3d3-ba3957cff84f",
   "metadata": {},
   "source": [
    "### 5. dataset info conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d76c4186-057a-44a6-879a-5974efa12c39",
   "metadata": {},
   "source": [
    "This dataset contains information about the distribution of family income across different districts and neighborhoods in Barcelona"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "958d2eb8-e99d-422a-8022-2d67344165e1",
   "metadata": {},
   "source": [
    "### 6. Choose the KPI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c54dac2-067d-4062-b931-21e79aecf088",
   "metadata": {},
   "source": [
    "Based on this dataset, the following analysis objectives could be considered:\n",
    "\n",
    "- Trend Analysis: Examine the trends in population and income index over the years for each district.\n",
    "- Income Index Distribution: Visualize the distribution of the income index for each year using histograms or box plots.\n",
    "- Heatmaps Visualize: Use heatmaps to visualize changes in population and income index over time across different districts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6a5f51d-570b-4cfa-ba88-85c6db6ec6d4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
